This dataset has information pulled from the Global Health Observatory (GHO) data repository under the World Health Organization (WHO), with additional country specific information collected by the United Nations (UN). There are 22 variables and 2938 observations in this dataset, with 20 of those variables being predicting variables.
This dataset describes life expectancy of 193 different countries, along with various other factors that may impact life expectancy. Some of these additional factors include, but are not limited to: mortality rate of adults and infants, alcohol consumption, immunizations of specific diseases, deaths by specific diseases, Gross Domestic Product, and population of the country. All of the data relates back to a specific country.
Anyone who has stayed up to date with the news will have likely heard about the recent “anti-vaxers” movement. Essentially, people are starting to believe that vaccinations actually have a negative impact on quality of life and life expectancy, and can also lead to autism. While many of these disadvantages of not being immunized, especially the connection to autism, have been disproven, many still do not get vaccinated. Certainly, everyone is entitled to their own choices and beliefs, so we are hoping that an analysis of this dataset may shed some scientific light on both sides of the arguments, right or wrong. While there have been many datasets about life expectancies, there are not many that accompany immunization statistics with it.
Furthermore, we hope, through studying and modeling the relationship between life expectancy and various factors, to identify some of the biggest factors that influence life expectancy. Through this, we could demonstrate the biggest leverage that an under-developed country could take action on fairly easily to improve the life expectancy of their people.
Import libraries to use
library(lmtest)
Load the data into R
life_expectancy <- read.csv('Life Expectancy Data.csv')
Analyze the plots between life expectancy and the rest of the predictors to do a manual selection of potentially useful predictors
pairs(life_expectancy[, c(4, 1, 2, 3, 5)], col = 'dodgerblue')
pairs(life_expectancy[, c(4, 5, 6, 7, 8)], col = 'dodgerblue')
pairs(life_expectancy[, c(4, 9, 10, 11, 12)], col = 'dodgerblue')
pairs(life_expectancy[, c(4, 13, 14, 15, 16)], col = 'dodgerblue')
pairs(life_expectancy[, c(4, 17, 18, 19)], col = 'dodgerblue')
pairs(life_expectancy[, c(4, 20, 21, 22)], col = 'dodgerblue')
Clean up source data by excluding records with NA value in the columns of interest. Also exclude records with unreasonable records. For example Income.composition.of.resources = 0 does not seem reasonable comparing to the overall distribution.
## remove rows with NA
subdata = life_expectancy[complete.cases(life_expectancy), ]
## remove rows with Income.composition.of.resources=0
subdata = subset(subdata,subdata$Income.composition.of.resources!=0)
## remove rows with BMI less than 8
subdata = subset(subdata,subdata$BMI>8)
## remove country from the data set as it is not a predictor we want to include
life_exp_clean_data = subdata[, -1]
## create a detaset that contains columns that believe have predicting power of life expectancy
life_exp_good_data = subdata[,c(2, 3, 4, 5, 7, 8, 11, 13, 15, 16, 17, 21, 22)]
Fit an additive linear regression model for all available predictors
le_full_add_model <- lm(Life.expectancy ~ ., data = life_exp_clean_data)
Fit an additive linear regression model for all predictors we manually analyzed as useful
le_add_model <- lm(Life.expectancy ~ ., data = life_exp_good_data)
Fit an interaction linear regression model up to all 2 way interactions for all available predictors
le_inter_model <- lm(Life.expectancy ~ . * ., data = life_exp_good_data)
Identify high leverage data points
full_add_lev <- hatvalues(le_full_add_model) > 2 * mean(hatvalues(le_full_add_model))
add_lev <- hatvalues(le_add_model) > 2 * mean(hatvalues(le_add_model))
inter_lev <- hatvalues(le_inter_model) > 2 * mean(hatvalues(le_inter_model))
Identify influential data points
full_add_inf <- cooks.distance(le_full_add_model) > 4 / length(cooks.distance(le_full_add_model))
add_inf <- cooks.distance(le_add_model) > 4 / length(cooks.distance(le_add_model))
inter_inf <- cooks.distance(le_inter_model) > 4 / length(cooks.distance(le_inter_model))
Refit models without high leverage or influential data poitns
le_full_add_model <- lm(Life.expectancy ~ ., data = life_exp_clean_data, subset = !full_add_lev | !full_add_inf)
le_add_model <- lm(Life.expectancy ~ ., data = life_exp_good_data, subset = !add_lev | !add_inf)
le_inter_model <- lm(Life.expectancy ~ . * ., data = life_exp_good_data, subset = !add_lev | !add_inf)
Check normality assumption for the full additive model
qqnorm(resid(le_full_add_model))
qqline(resid(le_full_add_model))
shapiro.test(resid(le_full_add_model))
##
## Shapiro-Wilk normality test
##
## data: resid(le_full_add_model)
## W = 0.98636, p-value = 3.042e-10
Check normality assumption for the manually selected additive model
qqnorm(resid(le_add_model))
qqline(resid(le_add_model))
shapiro.test(resid(le_add_model))
##
## Shapiro-Wilk normality test
##
## data: resid(le_add_model)
## W = 0.9884, p-value = 3.741e-09
Check normality assumption for the 2 way interaction model
qqnorm(resid(le_inter_model))
qqline(resid(le_inter_model))
shapiro.test(resid(le_inter_model))
##
## Shapiro-Wilk normality test
##
## data: resid(le_inter_model)
## W = 0.98403, p-value = 2.297e-11
Check equal variance assumption for the full additive model
plot(fitted(le_full_add_model), resid(le_full_add_model))
bptest(le_full_add_model)
##
## studentized Breusch-Pagan test
##
## data: le_full_add_model
## BP = 34.969, df = 20, p-value = 0.02027
hist(resid(le_full_add_model))
Check equal variance assumption for the manually selected additive model
plot(fitted(le_add_model), resid(le_add_model))
bptest(le_add_model)
##
## studentized Breusch-Pagan test
##
## data: le_add_model
## BP = 40.865, df = 12, p-value = 5.162e-05
hist(resid(le_add_model))
Check equal variance assumption for the 2 way interaction model
plot(fitted(le_inter_model), resid(le_inter_model))
bptest(le_inter_model)
##
## studentized Breusch-Pagan test
##
## data: le_inter_model
## BP = 113.37, df = 77, p-value = 0.004429
hist(resid(le_inter_model))
Perform backwards AIC to figure out which predictors are useful
le_full_add_model_sel <- step(le_full_add_model, trace = 0)
le_add_model_sel <- step(le_add_model, trace = 0)
le_inter_model_sel <- step(le_inter_model, trace = 0)
Perform an ANOVA test to see which model is preferred between the manually selected additive model and the manually selected 2 way interaction model, with an \(\alpha = .05\)
anova(le_add_model_sel, le_inter_model_sel)[2, 'Pr(>F)'] < .05
## [1] TRUE
The larger, 2 way interaction model, is preferred.
The first thing we did was create a pairs plot for every predictor as it related to life expectancy, the response we chose to use for this project. Looking at the pairs plot, we selected a series of predictors that we felt were useful to the dataset, but also predictors that we thought would be interesting to analyze. Then, we decided to fit 3 models: an additive model with all available predictors, an additive model with the predictors we manually selected, and a 2 way interaction model with the manually selected predictors, as well. We chose to do this because we knew the dataset had too many predictors to be able to accurately fit anything beyond an additive model. So, we included the full additive model for comparison, but were ultimately focused on the predictors that we manually selected.